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Summary 



A dynamic treatment regime effectively incorporates both accrued information and 
long-term effects of treatment from specially designed clinical trials. As these become 
more and more popular in conjunction with longitudinal data from clinical studies, the 
development of statistical inference for optimal dynamic treatment regimes is a high 
priority. This is very challenging due to the difficulties arising form non-regularities 
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^ '. in the treatment effect parameters. In this paper, we propose a new reinforcement 

in 

regularities can be resolved and valid statistical inference established. We also propose 



learning framework called penalized Q-learning (PQ-learning), under which the non- 



a new statistical procedure — individual selection — and corresponding methods for 
incorporating individual selection within PQ-learning. Extensive numerical studies 
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are presented which compare the proposed methods with existing methods, under a 
variety of non-regular scenarios, and demonstrate that the proposed approach is both 
inferentially and computationally superior. The proposed method is demonstrated 
with the data from a depression clinical trial study 

Keywords: Dynamic treatment regime; individual selection; multi-stage; non-regularity; 
penalized Q-learning; Q-learning; shrinkage; two-stage procedures. 



1 Introduction 



Developing effective therapeutic regimens for diseases is one of the essential goals of 
medical research. Two major design and analysis challenges in this effort are: taking 
accrued information into account in clinical trial designs and effectively incorporating 
long-term benefits and risks of treatment due to delayed effects. One of the most 
promising approaches to deal with these two challenges has been recently referred to 
as "dynamic treatment regimes" or "adaptive treatment strategies" (Murphy, 2003), 
and the method has been utilized in a number of settings, such as drug and alcohol 
dependency studies. 

Reinforcement learning — one of the primary tools used in developing dynamic 
treatment regimes — is a sub-area of machine learning, where the l earning behavior 



is thr ough trial-and-error interactions with a dynamic environment (IKaelbling et al. 



19961 ). Because reinforcement learning techniques have been shown to be effective in 
developing optimal dynamic treatment regimes, the area is attracting increased at- 
tention among statistical researchers. As a recent example, a new approach to cancer 
clinical trials based on the specific area of reinforcement learning called Q-learning, 



has been proposed by 



Zhao et al 



have al so been proposed 



ample, 



Chakraborty et al. 



linear models. Other related literature includes 
qu entist and Bayesi a n) by 



( 120091 ). Extensive statistical estimating methods 
o r opt imal dynamic treatment regimes, including, for ex- 
(120091 ) . who developed a Q-learning framework based on 

ikeliho od-based methods (both fre- 



Thall et al. 



by 



Lunceford et al. 



(2002) 



Murphvl ( 



(2000 



2003); 



2002 



20071) a n d semiparametric m e thods 



Robinsl (120041) . IWahed and Tsiatisl ( 12006 



20041 ) 



Moodie et al 



(120091 ) and iMoodie and Stephens! ( iTo appear! ) . 



In contrast to the substantial body of estimating methods, the development of sta- 
tistical inference for optimal dynamic treatment regimes is very limited and far from 
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ready. This sequential, multi-stage decision making problem is at the intersection of 
machine learnin g, optim i zation and statistical inference and is thus quite challenging. 



As discussed in 



Robins! ( 120041 ). and recognized by many other researchers, the key 



difficulty lies in the fact that the treatment effect parameters at any stage prior to 
the last stage may be non-regular for certain longitudinal distributions of the data, 
where non-regularity in this instance means that the asymptotic distribution of the 
estimator of the treatment effect paramet er does not converge uniformly over the 
parameter space (jChakraborty et all . 120091 ) . 

This non-regularity arises when the optimal last stage treatment is non-unique 
for at least some subjects in the population, causing estimation bias and failure of 
traditional inferential approaches. There have been a number of proposals for correct- 
ing this problem. For example, Moodie and Richardson (2010) proposed a method 
called Zeroing Instead of Plu gging In (ZIPI). Th i s met ho d is also referred to as the 



hard-threshold estimator by 



Chakraborty et al. 



(120091). 



Chakraborty et al 



(120091) 



also proposed a soft-threshold estimator and implemented several kinds of bootstrap 
methods. Both the hard-threshold estimator and the soft-threshold estimator essen- 
tially shrink the "problematic" term to decrease the degree of non-regularity. While 
this intuitively makes sense, there is, however, a lack of theoretical support for these 
methods. Moreover, extensive simulation studies in their associated papers indicate 
that neither hard-thresholding nor soft-thresholding, in conjunction with their boot- 
strap implementation, works uniformly well for all simulation settings. We are there- 
fore motivated to develop improved, asymptotically valid estimation and inference for 
optimal dynamic treatment regimes. 

In this paper, we develop a new reinforcement learning framework for discovering 
optimal dynamic treatment regimes: Penalized Q-learning (abbreviated hereafter as 
PQ-learning). This new backward recursive multistage learning approach can be 
viewed as a penalized version of Q-learning. The major distinction of the proposed 
PQ-learning from traditional Q-learning is in the form of the objective Q-function at 
each stage. While the proposed method shares many of the properties of traditional 
Q-learning, there are at least three significant advantages which we now describe. 

First, the notorious and inevitable non-regularity issue associated with Q-learning 
can be resolved with PQ-learning. At each stage of Q-learning, the maximization 
functional over individual treatments are involved, hence there is at least one non- 
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different iable point over the range of the treatment parameters. If the probability 
mass on this point is positive, i.e., some individuals have no treatment effects, it will 
cause non-ignorable non-regularity issues that will yield failure of existing inferential 
methods. With PQ-learning, all individuals experiencing no treatment effect can be 
identified with probability converging to one, as in the oracle setting. 

Second, we propose effective inferential procedures based on PQ-learning for op- 
timal dynamic treatment regimes. In contrast to existing bootstrap approaches, our 
variance calculations are based on explicit formula and hence are much less time- 
consuming. Thorough theoretical studies and extensive empirical evidence both sup- 
port the validity of the proposed methods. 

Third, since PQ-learning puts a penalty on each individual, it automatically ini- 
tiates another important statistical procedure: individual selection. The purpose 
of individual selection is to select those individuals without treatment effects from 
the population. Successful individual selection, i.e., correctly identifying individuals 
without treatment effects, is the key to resolving the non-regularity problem. 

Besides improving statistical inference, individual selection is itself an important 
task in identifying optimal dynamic treatment regimes and in many biomedical and 
clinical studies. If individuals without treatment effects can be correctly identified, 
then the corresponding components of the history of these individuals potentially 
need not be collected to make decisions using the optimal dynamic treatment regime. 
This could significantly reduce the cost of data collection during implementation 
of the optimal dynamic treatment regime. While the proposed individual selection 
procedure shares some similarities with certain commonly used variable selection 
methods, the approaches are fundamentally different in other ways. These issues 
will be addressed in greater detail in the paper. 

The remainder of the paper is organized as follows. In Section 2, we provide 
a review of statistical problems in reinforcement learning. The proposed penalize 
Q-learning and individual selection procedure are presented in Section 3, where the 
implementation and the statistical properties are discussed in detail. Some empirical 
results are presented in Section 4. We apply the proposed approach to the Sequenced 
Treatment Alternatives to Relieve Depression (STAR*D) clinical trial in Section 5. 
A summary of our findings with a discussion is given in Section 6. Proofs are deferred 
to the Appendix. 
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2 Statistical Problems in Reinforcement Learning 



2.1 Reinforcement Learning 

The basic reinforcement learning procedure involves 

i trying and recording a sequence of actions, 

ii statistically estimating the relationship between the actions and consequences 
and 

iii choosing the action that results in the most desirable consequence based on 
statistical decisions. 



A det ailed introduction of reinforcement learning can be found in 



Sutton and Barto 



( 119981 ). In a reinforcement learning based clinical trial design, we choose a sequence 
of actions applied to the patient and the environment responds to those actions and 
provides feedback. Here, "environment" refers to the system consisting of the human 
body and related additional sources of measurements. Specifically, we use random 
variable S to denote the set of environmental states and A to denote the set of possible 
actions. For example, states can represent individual patient prognostic factors and 
actions can represent different treatment agents or dose levels. Their time-dependent 
versions are denoted S t = {S , Si, . . . S t } and A t = {A , Ai, . . . , A t }, respectively. We 
use the corresponding lower case to denote a realization of these random variables 
and random vectors. The time points correspond to clinical decision points in the 
course of patient treatment. After each time step t, as a consequence of a patient's 
treatment, the patient receives a numerical reward, denoted with a random variable 
Rt, which can be represented as a function R(-) of the current state St, current action 
A t and next state St+i, that is, Rt = R{St, A t , St+i). We also denote a realization of 
R t as r t = R(s t ,a. t ,s t+ i). 

Within a reinforcement learning framework, an exploration policy ir can be rep- 
resented as 7Tt(st, at_i) = at, a mapping from state s t and action a t _i to the set of 
possible actions. In the clinical setting, a policy is a treatment regimen or a rule. 
Since our goal in the clinical setting focuses on discovering the treatment that yields 
a maximized long term reward for the patient, i.e., an optimal personalized treatment, 
thus seeking the optimal policy that maximizes the expectations of the total rewards 
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over the time trajectories is a major goal of clinical research. Accordingly, we define 
a value function as a function of states and actions: 

u 



Vt(s t ,a t _i) 



E, 



k=0 



7 r t +k 



s 4 , A 



t-i 



a t _i 



where the discount rate 7 G [0,1] can be interpreted as a control to balance a patient's 
immediate reward and future rewards. The value function measures the success of 
the treatment policy n. Letting II denote the set of all policy candidates, the optimal 
value function can be defined as V t *(s t , a t _i) = max Tgn Vt( s t, a t-i)- 

The value functions used in reinforcement learning typically satisfy the recursive 
Bellman equation (Bellman, 1957), which forces the optimal policy ir* to satisfy 



7Tt(s t , a t _i) G argmax at E r t + jV* +1 (S t +i, A 



St = St, A 



t-i 



a t _i 



Due to computational challenges, it is usually not possible to directly compute an 
optimal policy by directly solving the Bellman equation. As an alternative method 
which requires less memory and less co mputation, temporal-difference learn ing; can 



also be used to obtain optimal policies (IKaelbling et al 



1996 



Sutton 



19881 ). In the 



next section, we will introduce a very important off-policy temporal-difference learn- 
ing method, Q-learning, which is a popular approach to estimate dynamic treatment 
regimes. Q-l earning is the estimating approach for which the statistical inference 



procedures in 



Chakraborty et al. 



(120091 ) were proposed. 



2.2 Q-Learning Procedure 

The motivation of Q-learning is that once the Q-functions have been estimated, it 
is only necessary to know the state to determine the best action. From a statistical 
perspective, the optimal time-dependent Q-function is 



Qt(s uS L t )=E r t + 7 KU(Sm,A t 



St — St, At — & t 



Since by definition, V t * +1 (s t+1 , a t ) = max fll+1 Q^ +l (s t+1 , a t , a t+1 ), and hence 
7r^ +1 (st+i, a t ) = argmax <5^ +1 (s t+ i, a t , a t +i), one-step Q-learning thus has the sim- 
ple recursive form: 



Qt(s t ,a.t) = E 



i?t + 7 max Qt+i(St+i, at+i ^ 



St, At 



at 
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According to the recursive form of Q-learning in (JT]) , we must estimate Q t backwards 
through time t = U, U — 1, . . . , 0. To estimate each Q t , we parameterize Qt(st, &t', Ot) 
as a function of the parameter 6 t . After finishing estimation through this backward 
recursive process and obtaining the sequence estimators {Qt}t=o, we can estimate the 
optimal treatment regimes 7r t = argmax at <5t(st, a t ; t ), fo r t = 1 U. 



We will use a simple, two-stage example (also used in IChakraborty et al.l (j2009[ )) 
to illustrate the proposed approaches. Let the Q-function for time t — 1, 2 be modeled 

as 

Q t (S tl A; (3 t: tf> t ) = [3jS tl + (i/}JS t2 )A t , (2) 

where St is the full state information at time t and Sti and are subsets of St 
selected for the model. The action A t takes value 1 or —1. The parameters of 
the Q-function are 6 t = (0[, ift't) T , where (3 t reflects the main effect of current 
state on outcome, while ij) t reflects the interaction effect between current state and 
treatment choice. Let Y t denote the optimal total potential reward at time t. In 
this work, we will assume that Yi = the discount rate 7 = 1 and Y\ — R\ + 
maxagf.! !} <52(S2, a; ^20 > V'20)' where /3 20 and t/? 20 are the true unknown values. The 
observed data consist of (S^, A ti , R ti ) for patients i = 1, . . . , n and t = 1,2, from a 
sample of n patient trajectories. 

The two-stage empirical version of the Q-learning procedure can now be summa- 
rized as follows: 

Step 1. Start with a regular and non-shrinkage estimator, based on least squares, for 
the second stage: 

-1 



2 = (f3 2 , V 2 ) = axgmin^^ P n (y 2 - g 2 (S 2 , A 2 ; (3 2l */> 2 ))' 



Z 2 r Z2 



2 1 2, 



where Z 2 is the stage-2 design matrix and Y 2 = (Y 2 i, ...,Y 2n ) T and ¥ n f(x) = 
V ra H7=i f( x i) i s the empirical measure. 

Step 2. Estimate the first-stage individual pseudo-outcome by Y 1 — (Y^ M , . . . , Y-^ M ) T , 
where 

Y™ = J R li + max o g 2 (S 2 i,a;3 2 ,'5 2 ) = J R 1 i + 3 2 S 21i + |^ 2 S 22 i|. (3) 



7 



Step 3. Estimate the first-stage parameters by least square estimation: 
0* M = aigmmp^Fni?™ - Q^, A i; /3 1 ,^ 1 )) 2 = [zfZij 



1 rp ~ ffM 



where Zi is the stage- 1 design matrix. The correspondin g estimator of , 

^ HM 

denot ed by , is referred to as the hard-max estimator in 



Chakraborty et al. 



( 120091 ). because of the maximizing operation used in the definition. 



2.3 Non-regularity Problem in Statistical Inference 

When the Q-function is taking the linear model form (j2J), the optimal dynamic treat- 
ment regime for patient {i, i — 1, . . . , n} is given by 

di(Sji) = axgmax a .{i/)JS j2 i)a i = sgn(i/>JS i2 ;), for j = 1, 2, 

where sgn(:r) = 1 if x > and —1 otherwise. The parameters t/> 2 are of particular 
interests for estimation and inference of the optimal dynamic treatment regime, as 
confidence intervals for i/> ■ can lead to confidence intervals for di. 

During the Q-learning procedure, when there is a positive probability that i/ , 2o^22 = 
0, the first-stage hard-max pseudo-outcome Y 1 is a non-smooth function of i/j 2 - As 

— HM ^ HM 

a linear function of Y 1 , the hard-max estimator is also a non-smooth function 
of . Consequently, the asymptotic distribution of -y/n(i/> 1 — V'io) * s neither nor- 
mal nor any well-tabulated distributions if P(i/> 20 S22 = 0) > 0. In this non-standard 
case, standard inference methods such as Wald-type confidence intervals are no longer 
valid. 



2.4 Review of Existing Approaches 

To overcome the difficulty of inference of tp 1 due to non-regularity in Q-learning, 
several methods have been proposed and we will briefly review these methods in the 
two-stage set-up. They are referred to as the hard-threshold estimator (also called 
Zeroing Instead of Plugging In (ZIPI) estimator in Mood ie and Richardson (2010) and 



the soft-threshold estimator in 



Chakraborty et al. 



fl200j3). Since all these methods are 



also nested in the Q-learning procedure, we update the two-stage version of Q-learning 
as follows: 



S 



Step 2'. Estimate the first-stage individual pseudo-outcome by shrinking the second- 
stage regular estimators via hard-thresholding or soft-thresholding. Specifically, 
the hard-threshold pseudo-outcome is denoted Y x = (Y-^f T , . . . , Y^ T ) T , with 

-T 

i 22i Zj 2 a 22i 

where £ 2 is the estimated covariance matrix of i\) 2 - 



VHT o ,3 T Q , U ?, T C 1 1 f V^lV^al ^ \ 
>li = «li + P 2 S 2 li + IV2 ^22i| • li , = > Za/2 \, 



ST 



The soft-threshold pseudo-outcome is denoted Y 1 = (Iff, . . . , Yf n ) , with 



= F lj + /3 2 S 2 i i + |V 2 S 



22/ 



A, 



1^2 S 22i| 



i = 1, . . . ,n, 



(5) 



where x + = xl{x > 0} is the positive part of a function and Aj is a tuning 
parameter. 



Step 3'. Estimate the first-stage parameters by least squares estimation: 
el = (Pl^lf = argmin^^P^ - Q^Si, A i; (3 V = 



rri -"^ O 



where Y x = (Kfj, y^) 7 " is the first-stage pseudo-outcome obtained in Step 2'. 
It can be either a hard-threshold or soft-threshold pseudo-outcome, as defined in 
either or dSJ), respectively. The corresponding estimator of denoted 
can be the hard-threshold estim ator (also referred t o as Z eroing Instead 
of Plugging (ZIPI) estimator in 



Moodie and Richardson 



(bold )) or the soft- 



threshold estimator , respectively. 

The hard-thresholding and soft-thresholding methods can be viewed as upgraded 
versions of th e hard-max methods in te rms of reducing the degree of non-regularity. 

(120091 ) commen ted that t 



For example, 



Chakraborty et al. 



the form of the non-negative garrote estimator (jBreimanl . 



re third term in 



takes 



19951 ). through which the 



problematic term |'0 2 S22 1 is expected to shrink (or thresholded) towards zero. Even 
if the degree of non-regularity is somewhat decreased, in general, tp 1 and 
remain non-regular estimators of i/? 10 , and standard inference met hods such as Wald 



Chakraborty et al. 



type c onfidence intervals are still not valid. Numerical studies in 
(120091 ) show an apparent bias for these two methods in certain simulation settings. 

Due to the nature of the non-regularity, the asymptotic distributions of these three 
estimators are not well tabulated, and thus direct inference for is not feasible. 
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Bootstrap methods seem to be the only remedy. IChakraborty et al.l ( 120091 ) applied 
several bootstrap methods to construct confidence interval for i/^. Unfortunately, no 
theoretical support was provided for any of these methods. Moreover, none of these 
bootstrap confidence intervals for hard-thresholding and soft-thresholding methods 
perform uniformly well in all of the simulation scenarios considered. 

In summary, the first-stage pseudo-outcome for these three existing methods can 
be viewed as shrinkage functionals of certain standard estimators (such as least square 
estimator in this two-stage set-up). Although the idea of simultaneously approxi- 
mating the hard-max estimator (or more precisely, the absolute value function) and 
reducing the degree of non-regularity sounds reasonable, it may not be an appro- 
priate approach to use shrinkage formulations of existing regular estimators directly 
as the first-stage pseudo-outcome. The reason is that even if these estimators form 
shrinkage estimators under certain conditions (e.g., least squares regression with the 
only covariates S21 and under an orthonormal design), in general, they are not opti- 
mizers of reasonable objective functions. Consequently, even if these estimators can 
successively achieve the goal of shrinkage, the following two drawbacks remain that 
negate their ability to be used for statistical inference for optimal dynamic treatment 
regimes. 

First, the bias of these "shrinking" first-stage pseudo-outcomes can be large in 



finite samples, leading to further bias in the first 



stage estimation o: 



Chakraborty et al. 



t/^ . T his point 
(12009L Second, 



has been demonstrated in the empirical studies of [ 
and more importantly, these shrinking functional estimators do not appear to possess 
the oracle property. Here we refer to the oracle property as: with probability tending 
to one, the set = {i : |V ? 2o^22i | > 0} can be correctly identified and the resulting 
estimator performs as well as the oracle estimator, which knows in advance the right 
set These two concerns probably are the direct reasons why these three existing 
estimation methods, with their corresponding bootstrapping confidence intervals, do 
not perform well (with bias and invalid coverage length, for example) in some em- 
pirical studies. This suggests that it may not be appropriate to attempt to fix the 
non-regularity by regularizing existing estimators directly. 
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3 Inference Based on Penalized Q-Learning 



In this section, we propose an innovative method, penalized Q-learning (PQ-learning), 
for statistical inference in reinforcement learning problems. Our method is based on 
individual selection via a penalized likelihood; so it will automatically determine those 
individuals whose value functions are not affected by treatment assignments, i.e, those 
individuals for whom ift 2 S 22 given in Section 2.2 is 0. As a result, this oracle property 
ensures that subsequent inference in the Q-learning framework will be the same as 
if we knew which individuals had no treatment effect. Accordingly, the resulting 
inference will no longer suffer from the non-regularity problem discussed above. To 
describe our method, we still focus on the two-stage setting as given in Section 2.2 
and use the same notation. The generalization to the multiple-stage setting is similar 
and we will it discuss briefly at the end of the section. 



3.1 Estimation Procedure 



As a backward recursive reinforcement learning procedure, our method follows es- 
sentially the same three steps as the usual Q-learning method. The only difference 
is that our approach replaces Step 1 of the standard Q-learning procedure with the 
following: 

Step 1'. Instead of minimizing the summed squared differences between Y 2 and ^2(82, A 2 ; (3 2 , if> 2 ) 
we minimize the following penalized objective function: 



W 2 (0 2 ) = - Q2(S 2u A 2l - (3 2 , V 2 )) 2 + XXfl^Saal), 



(6) 



i=l 



i=l 



where P\ n (-) is a pre-specified penalty function and A n is a tuning parameter. 

Because of this penalized estimation, we call our approach "penalized Q-learning" 
and abbreviate it as PQ-learning. Since the penalty is put on each individual, we also 
call Step 1' "individual selection." 

Using penalty functions in PQ-learning, i.e., individual selection, enjoys simi- 
lar "shrinkage" advantages as penalized methods described i n the recent variable 



selecti on literature. To name a few, t he bridge regression in 



( 11993 



ties in 



, the LASS O in 



Frank and Friedman 



ibshiranil f 19961 ). the SCAD and other fo 



Fan and Lil (120011 ) . the Dantzig selector in lCandes and Tad (120071 ) . the adaptive 



ded-c oncave penal- 



11 



LASSO in IZoul (120061 ) and one-step estimator in IZou and Lil (120081 ). In variable se- 
lection problems where the selection of interest consists of the important variables 
(mostly covariates) with nonzero coefficients, using appropriate penalties can shrink 
the small estimated coefficients to zero to enable the desired selection. In the individ- 
ual selection done in the first step of the proposed PQ-learning approach, penalized 
estimation allows us to simultaneously estimate the second-stage parameters 62 and 
select individuals whose value functions are not affected by treatments, i.e., those 
individuals whose true values of ^2^22 are zero. 

The above fact is extremely useful in making correct inference in the subsequent 
steps of the PQ-learning procedure. To understand why, we recall that the non- 
regularity problem in the usual Q-learning procedure is mainly caused by difficulties 
in obtaining the correct asymptotic distribution of 



>/n(ft S 22| - |-0lo S 22 |), 

where i/j 2 o * s the true value of i/v Via our PQ-learning method, we can identify 
individuals whose ^2^22 = ^20^22 takes value zero; moreover, we know that for these 
individuals tp 2 ^22 has the same sign as t/> 2 o°22- I n this case, the above expression is 
equivalent to 

y/n$ 2 ~ V'2o) Ts 22sign('0o S 22 )- 

Hence, correct inference can be obtained following standard arguments. More rigorous 
details will be given in our subsequent asymptotic theorems and proofs. 

The choice of the penalty function p\ n (-) can be taken to be the same as used 
many popular variable selection methods. Specifically, we require p\ n (-) to possess 
the following properties: 



Al. For non-zero fixed 9, lim Ti 
and W(|0|)} 0. 

A2. For any M > 0, lim^ooinf 



n 1/2 Px n (\0\) 



0, lim,; 



n 1/3 KOT = o, 



\6\<Mn- 



Among many penalt y functions satisfy ing Al and A2, some common ch oices i nclud e 
the SCAD penalty (IFan and Lil . 1200 ll ) and the adaptive lasso penalty (IZoul . 120061 ) . 
where p\ n {9) = \ n 9/\9^\ a with a > and # (0) being a root-n consistent estimator of 
9. To achieve both sparsity and oracle properties, the tuning parameter A n in these 
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examples should be taken correspondingly. For illustration, Figure 1 plots these 
penalty functions. The adaptive lasso method will be implemented in this paper, 
where A n can be taking as scalars satisfying y/n\ n — > and n\ n — > oo. 



(a) Hard (to) Weighted L I (c) SCAD 




Figure 1: Plot of thresholding functions with A=2 for (a) the Hard Threshold; 
(b)adaptive lasso with a = 3; (c) SCAD, with the hard-max function as a reference 
in each panel. 

3.2 Implementation 

The minimization in Step 1' of the PQ-learning procedure has some unique features 
which distinguish it from the optimization done in the variable selection literature. 
First, the component to be shrunk, i$2$22i-, is subject-specific; second, this component 
is a hyperplane in the parameter space, i.e, a linear combination of the parameters. 

To deal with these issues, in this section, we propose an algorithm for the minimiz- 
ing problem of (EJ) based on local quadratic approximation (LQA). Following Fan and 
Li (2001), we first calculate an initial estimator if> 2 (o)- We then obtain the following 
LQA to the penalty terms in ([6]): 

i\ I Tq h /,7 T c h , lK„(l^2(0) S 22i|) T 2 qT % 

P\ n (m S 22i|) ~ PA n (|V'2(0) S 22i|) + g ((^2 S 22i) ~ (^2(0) S 22i) ) 

1^2(0)^2211 
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for -02 close to ^/> 2 (ov Thus, (jSJ) can be locally approximated up to a constant by 



X>< - g 2 (s 2i , a* ; /3 2 , </> 2 )) 2 + i £ ^#^(^s 224 ) 2 . (7) 

i=l i=l |V , 2(0)S22i| 

The updated estimators for ?/> 2 and /3 2 can be obtained by minimizing the above 
approximation. Under the special case where Q(-) is given by (2), this minimization 
problem has a closed form solution: 

■02 = [^22(1 — X 2 i(X 21 X 2 i) 1 X 2 r 1 + D)X 2 2] X 22 (I — X 2 i(X 2 r L X 2 i) 1 X 21 )Y 2 , 

3 2 = (X 2 r 1 X 21 )- 1 Xj 1 (Y 2 - X 22 ^ 2 ), 

where X 22 is a matrix with z'-th row equal to S^A^, X 21 is a matrix with i-th row 
equal to S 21i , I is the n x n identity matrix and D is an n x n diagonal matrix with 

D « = 2^A n ( I ^2(0) S 22i I ) / 1 -02(0) S 22i | ■ 

The above minimization procedure can be continued for more than one step or 
until convergence. However, as discussed in Fan and Li (2001), either the one-step 
or fc-step estimator will be as efficient as the fully iterative method as long as the 
initial estimators are good enough. A well known limitation of the LQA algorithm 
is that although it can shrink \ip 2 ^22i\ to a very small value if the true value is zero, 
it cannot set it exactly to zero. Therefore, in practice, we will set \if^ 2 ^22i\ — once 
the value is below a pre-specified tolerance threshold. 

The choice of LQA is mainly for convenience in solving the penalized least squares 
estimation in ([6]). If least absolute deviation estimation or some other quantile re- 
gressions is used in place of l east squares , then the local linear approximation of the 



penalty function described in lZou and Lil ( 120081 1 can be used instead of LQA, and the 



resulted minimization problem can be solved by linear programming. The linear pro- 
gramming approach can shrink small values of |i/> 2 S 22 j| exactly to zero and therefore 
avoids the additional th resholding done in LQA. Alternatively, the Dantzig selector 
f lCandes and Tad . 120071 ) can be used with penalized least squares estimation in ()6]), 
but the asymptotic properties for this setting are beyond the scope of this paper, 
although they are currently being investigated by the authors. 
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3.3 Asymptotic Results 

In this section, we establish the asymptotic properties for the parameter estimators in 
our PQ-learning method, assuming that the support of S22 contains a finite number of 
vectors, say, T 1; T K . Moreover, we assume tp^o^k 7^ for k < K\ and ij)^ Q T k = 
for k > K 1 . Let 

n fc = #|{z: S 22 i = T fe , % = l,...,n}|, 

where for a set A, #\A\ is defined as its cardinality. 

Additionally, we assume that the penalty function p\ n (x) satisfies Al and A2 and 
that the following conditions hold: 

Bl. The true value for 2 , denoted by 6 2 o = (V'2o> 02o) T 1 minimizes 

limP n [F 2 -Q 2 (S 2 ,A 2 ;/3 2 ,^ 2 )] 2 ; 

n 

while, the true value for 0\, denoted by 6 W = (ipj , 0iq) t , minimizes 

1 2 

i?x + maxQ 2 (S 2 ,a;/3 20 , V 20 ) - Qi(Si, A] /3 X , Vi) • 

a 

In both expressions and in the following, we always assume that the limits exist. 

B2. For k = 1,2, with probability one, Qk{Sk, A^; Oj,) is twice-continuously differen- 
tiable with respect to 6}, in a neighborhood of 6^ and moreover, the eigenvalues 
of the Hessian matrix, 1^ = lim n P„[V^ q (Yk — Qk(Sk, A^, ^fc)) 2 ], are positive 
and bounded away from zero at Ok = Oko- 

B3. With probability one, nk/n = pk + O p (n -1 / 2 ) for some constant pk in [0, 1]. 

Condition Bl basically says that dio and O20 are the target values in the dynamic 
treatment regimes, which we consider to be the true values. Condition B2 can be 
verified via the design matrix in the two-stage setting. Specifically, if Q t takes the form 
of (2), this condition is equivalent to non- singularity of the design matrix [A t , S t A t }. 

Under these conditions, our first theorem shows that in Step 1' of the PQ-learning 
procedure, there exists a consistent estimator for # 2 : 

Theorem 1. Under conditions A1-A2 and B1-B3, there exists a local minimizer 
02 ofW 2 (0 2 ) such that ||0 2 -0 2 o|| = P (n~ l / 2 +a n ) , wherea n = maxf^ {p' Xn {\^w T k\)} ■ 



limE 

n 
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According to the properties of P\ n {-), we immediately conclude that 6 2 is y/n- 
consistent. From Theorem 1, we further obtain the following result, which verifies 
the oracle property of the penalized method: 

Theorem 2. Recall the set 

Ml = {i: ^ Q S 2 2i = 0} . 

Then under conditions A1-A2 and B1-B3, 

lim P(V>2 $22i = 0, for any i G Ml) = 1. 

n— >oo 

The set Ai^ consists of those individuals whose true value functions at the first 
stage have no effect from treatment. Thus Theorem 2 states that with probability 
tending to one, we can identify these individuals in AA^ using empirical observations. 
As discussed before, this result will be very useful in addressing the non-regularity 
problem for subsequent inference. Additionally, we will also need the asymptotic dis- 
tribution of 6 2 in order to make inference. This is provided in the following theorem: 

Theorem 3. Under conditions A1-A2 and B1-B3, 

v^(/ 20 + £){£ 2 - 2O + (ho + £) _1 b} -> N{0, J 20 }, (8) 

where 

b = (0l,Y,PkP'x n (\^k\)sgn^lT k )T K ^ , 

k=l 

and £ = diag{0 pxp ,J2PkP'L(\^ T k \)TkTl}. 

k=l 

Using the results from Theorems 1-3, we are able to establish asymptotic normal- 
ity of the first stage estimator 0\\ 

Theorem 4. Under conditions A1-A2 and B1-B3, let S 2 = (S^, S^sgn^ao^)) 1 '- 
Then 

V^(?i - Oio) I^o {Gi + limPnZiS^Ga}, (9) 

where G\ ~ N 



0,lim n co^V e>i Q 1 (S 1 ,A 1 ;6» 10 )(F 1 -g 1 (S 1 ,A 1 ;6> 10 )) 



Go- N 



0, (J 20 + S)- 1 / 20 (/2o 



and cov represent the sample variance. 
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3.4 Variance Estimation 

The standard errors for the estimated parameters can be obtained directly since we 
are estimating parameters and selecting individuals simultaneously. A sandwich type 
plug-in estimator can be used as the variance estimator for 6 2 : 

cot(0 2 ) = (/ 20 + E^/aoCJao + S)" 1 , 

where I 2 o = P n [V^ q (Y 2 — <52(S 2 , A 2 ; O2)) 2 ] is the empirical Hessian matrix and 
E = diag{Op X p, P n p^(|i/ J 2 S^DS^S^}. As £ is often negligible, we use 

cov(0 2 ) = J" 1 (10) 

instead, and this performs well in practice. The estimated variance for 6\ is then 
cov(fli) = 



To 1 



cov { V^QiCSx, A i; O^Y, - Q^Sx, Ai, 9,))} + P^S, c6v(0 2 )S 2 Zf 



where I w = P n [V^ q (Y\ — <2i(Si, Ai; Oi)) 2 ] is the empirical Hessian matrix. These 
variance estimators will be shown in simulations presented later to have good ac- 
curacy for moderate sample sizes. This success of direct inference for the estimated 
parameters makes statistical inference for optimal dynamic treatment regime possible 
in the multi-stage setting. 

3.5 Generalization to the Multi-stage Setting 

In this section, we will extend the inference procedure from the two-stage to the 
more general multi-stage setting. The PQ-learning procedure for finding the optimal 
dynamic treatment regime for a general [/-stage setting can be summarized in 2U + 1 
steps as follows: 

Step 1. Start from the Uth stage by minimizing the penalized Q-function at the Uth- 

stage: 

n n 

Wu(0u) = J2(Ym ~ Qu(Sm, A m - (3 V , ^ v )f + J^d^S^I), 

8=1 1 = 1 

where P\ n {') is a pre-specified penalty function and A n is a tuning parameter. 
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Step 2. Estimate the (U— l)th-stage individual pseudo-outcome by Yjy_i = (Yu-i,ii ■ ■ ■ > ^c/-i,n) T , 
where 



^cr— i,a — Ru-x,i + maxQ(7(St7 > j, a; Ou-i) — Ru-i,i + Pu^uu + iV'cjSt/^ii- 

a 

Step 3. Minimize the penalized Q-function in the (U — l)th-stage with the individual 
pseudo-outcome obtained from Step 2: 



n 



Wu-i(Ou-i) = ^(Yu-ij ~ Qu-i(Su-i,i, Au-x/, ^-i)) 2 + ^P^u-i^u-iM)- 
i=i i=i 

Step 4. Estimate the (U— 2)th-stage individual pseudo-outcome by Yjj_ 2 = (Y[/-2,i> • • • , ^/-2,n 
where 

Yy_2,j = Ru-2,i + maxQc/.i^Sfz-ii, a; #[7-2) = Ru-2,i + /3c/-iSc/-i,ii + \fpu-i^u-i,2i\- 



iT 



Step 2U+1. Estimate the first-stage parameters by least squares estimation: 

0! = argmin /3i ?/ , i P re (y 1 - Q 1 (S 1 , A x] /3 l5 Vi)) 2 = [zf Z x l ZfYx. 



- 1 T ' 



Theorem 4 can be used backwards recursively to obtain the asymptotic distribution 
of the parameters at each stage since the oracle properties can be inherited from 
the prior iteration. The plug-in variance formula ffTTT) can also be used backwards 
recursively for inference for the parameters in each stage. 

4 Simulation Studies 



Chakraborty et al. 



(120091 ) designed a thorough simulation study of two-stage Q- 
learning, which covers regular, non-regular and close-to-non-regular conditions. In 
this section, we apply the proposed method to the same simulation study conditions. 
Specifically, a total of n = 300 subjects are contained in the data. We set R± — and 
(Oi, Ai, O2, A2, R2) is collected on each subject. The binary covariates O f 's and the 
binary treatments At 8 are generated as follows: 



P(Oi = 1) = P(Oi = -1) = 1/2, 
18 



P(A t = 1) = P(A t = -1) = 1/2, t = 1,2, 
P(0 2 = 1\0 1 ,A 1 ) = l-P(0 2 = -l\0 1 ,A 1 ) = expit(6 1 1 + 6 2 A 1 ), 
where expit(x) = exp(x)/(l + exp(x)). 

R2 = 7i + 72O1 + 73A1 + 74O1A1 + 75A2 + 76O2A2 + 'y 7 A 1 A 2 + 5, 

where e ~ A^O, 1). The Q-functions for time t = 1,2 are both correctly specified as: 

Q 2 (0 1 , A x , 2 , A 2 - f3 2 , V> 2 ) =(3 21 + (3 22 0x + (3 23 Ax + 24 OxAx 

+ ^21^2 + ^22^2^2 + tp 23 AxA 2 , (12) 

Qx(O u Ax; Px, V'i) =Pn + faO x + ^ 11 A 1 + i'MA^ (13) 



Chakraborty et al 



(120091 ) . the true values of tf) 1 = (ipxxo, ^i2o) T are given 



As shown in 
by: 

^110 = 73 + qi\fi\ ~ g 2 |/ 2 | + (1/2 - gi)|/ 3 | - (1/2 - g 2 )|/ 4 |, 

^X20 = l4 + q'x\fx\-q' 2 \f2\-q'x\f3\ + q 2 \h\, 

where q\ = 1 / 4(expit(Sx + S 2 ) + expit(— 5\ +S 2 )), q 2 = l/4(expit(5i—5 2 ) + expit(—5i — 

£2)), q[ = l/4:(expit(Sx + 5 2 ) — expit(—5x + 5 2 )), q 2 = 1 / 'A(expit(8x — 5 2 ) — expit(—5\ — 

fa)), fi = 75 + 76 + 77, f2 = 75 + 76 - 77, h = 75 - 76 + 77, A = 75 - 76 - 77- Let 
7 = (71, ...,7 7 ) T . We consider the following six settings: 

(0,0,0,0,0,0,0) T A = 5 2 = 0.5. 

(0, 0, 0, 0, 0.01, 0, 0) T , Sx = S 2 = 0.5. 

(0, 0, -0.5, 0, 0.5, 0, 0.5) T , 6x = 5 2 = 0.5. 

(0, 0, -0.5, 0, 0.5, 0, 0.49) T , 5x = 5 2 = 0.5. 

(0, 0, -0.5, 0, 1, 0.5, 0.5) T , Sx = 1,S 2 = 0. 



Setting 1: 7 
Setting 2: 7 
Setting 3: 7 
Setting 4: 7 
Setting 5: 7 
Setting 6: 7 



0, 0, -0.5, 0, 0.25, 0.5, 0.5) T , Sx = 5 2 = 0.1 



The values of ^20^22 for each setting are listed in Table HI 

We applied the proposed penalized Q-learning with the adaptive lasso penalized 
regression to the six settings. The one-step LQA algorithm is used with least squares 
estimation for the initial values. The tuning parameter A in the adaptive lasso penalty 
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Table 1: Values of ^20^22 in the simulation studies. 







S22 — 


(1,0 2 ,A) 




Setting 


(1,1,1) 


(1,1,-1) 


(1,-1,1) 


(Irlrl) 


1 














2 


0.01 


0.01 


0.01 


0.01 


3 


1 





1 





4 


0.99 


0.01 


0.99 


0.01 


5 


2 


1 


1 





6 


1.25 


0.25 


0.25 


-0.75 



is chosen by five fold cross-validation. When the estimated value | V , 2 1 < 0.001, it 
will be set as zero in the stage-1 estimation. The simulation results shown in Table 
|4] were summarized over 1000 replications. We included both the oracle estimator 
and the hard-max estimator for comparison. Theoretical standard errors and 95% 
confidence intervals for the hard-max estimator are not available. 

Setting 1 is a completely non-regular setting, where V'20^22 = for all values of 
S22- The oracle estimator automatically sets V2 = an< l i s therefore regular. It 
has a very small bias, with standard errors accurately predicted by the theory and 
95% confidence interval coverage close to the nominal value. The PQ estimator's 
performance is very close to the oracle estimator, with similar bias, a slightly bigger 
but still well estimated standard error and similar confidence intervals. 

Setting 2 is regular but very close to setting 1 with V20S22 all equal to 0.01. 
The oracle estimator reduces to the hard-max. Although its bias is small, the oracle 
estimator's theoretical standard error is significantly overestimated and, as a result, 
its confidence intervals show significant over-coverage for ipno- On the other hand, 
the PQ estimator remains consistent and its standard error estimation remains close 
to the empirical values. 

Setting 3 is another non-regular setting. The value of i/?^ S 2 2 is equal to 1 for half 
of the subjects and equal to for the other half. As expected, the oracle estimator 
has the best performance in the sense of having smallest bias and standard error as 
well as being precisely predicted by the theoretical standard error. The PQ estimator 
has a bigger bias and standard error than the oracle estimator but the theoretical 
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standard error remains close to the empirical values. 

Setting 4 is a regular setting but close to Setting 3. The PQ estimator outperforms 
the oracle estimator, with both a smaller bias and a smaller standard error. This phe- 
nomena is consistent with findings in Setting 2, which is another nearly non-regular 
setting. However, in Setting 2, we did not observe substantial overestimation of the 
standard error nor over-coverage of the confidence intervals in the oracle estimator. 

Setting 5 is a non-regular setting similar to Setting 3 and the performances of the 
oracle and PQ estimators are very similar to their performances in Setting 3. 

Setting 6 is a completely regular setting with values of ■020^22 well above zero. 
The PQ estimator has a slightly bigger bias and a slightly bigger standard error than 
the oracle estimator. Both estimators's standard errors are well predicted from the 
theory. 

In summary, the behavior of the PQ estimator, including its bias, theoretical 
computed standard error and coverage probability of theoretically computed 95% 
confidence intervals, are consistent in all six settings, whether regular or non-regular. 
In non-regular or completely regular settings, PQ-learning usually has bigger bias and 
larger standard error than the oracle estimator. In regular settings which are close to 
no n-regular cases, it has sma ller bias and standard error than the oracle estimator. 



Chakraborty et al. 



( 120091 ) proposed several bootstrapped confidence intervals (PB: 
percentile bootstrap, HB: hybrid bootstrap, DB: double bootstrap) for the hard-max 
estimator as well as hard-thresholding estimators with threshold at 0.08 (HTo.os) and 
0.20 (HT .2o) an d soft-thresholding estimator (ST). We include their simulation results 
(V'no only) in Table H] for comparison. The coverage probabilities of all 10 inferential 
methods in the six settings are plotted in Figure 2. The boxplot of oracle estimator, 
PQ-estimator and hard-max estimator of ipu and ip\2 from 1000 estimates of these six 
settings are provided in Figures 3 and 4 respectively. The PQ-estimator has smaller 
bias and narrower inter-quantile range compared with the hard-max estimator in 
most of the settings. The bias and Monte-Carlo standard deviation of the hard-max 
estimator presented in Table 2 and Table 3 are similar, validating a direct comparison 
between the two study results. Readers are directed to the original articles for a 
discussion on the performances of different estimators and bootstrapping methods. 

Briefly speaking, the bias of the hard-max estimator in Settings 3 and 4 is relatively 
big. The percentile bootstrapped and hybrid bootstrapped confidence intervals can 
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Table 2: Summary statistics and empirical coverage probability of 95% nominal per- 
centile CIs for ipuQ and ^120 using the oracle estimator, the proposed PQ-learning 
based (PQ) estimator and the hard max (HM) estimator. Specifically, "Std-MC" 
refers to the standard deviation of 1000 estimates for ipn or ^120, "Std" refers to 
the average of the 1000 standard error estimates and "CP" refers to the empirical 
coverage probability of 95% nominal percentile confidence interval. A "*" indicates 
a significantly different coverage rate other than the nominal rate. 



Setting 




Bias 


^110 
Std-MC 




Std 


CP 


Bias 


^120 
Std-MC 




Std 


CP 


1 


oracle 


-0.0015 


0.058 





058 


94.6 


-0.0004 


0.060 





058 


94.3 




PQ 


-0.0013 


0.060 





061 


95.1 


-0.0009 


0.060 





058 


94.0 




HM 


-0.0005 


0.066 








-0.0023 


0.060 








2 


oracle 


-0.0025 


0.065 





075 


97.5* 


0.0003 


0.056 





059 


96.3 




PQ 


-0.0026 


0.059 





060 


94.0 


0.0013 


0.056 





058 


96.3 




HM 


-0.0025 


0.065 








0.0003 


0.056 








3 


oracle 


-0.0032 


0.071 





071 


95.3 


-0.0043 


0.057 





058 


94.9 




PQ 


-0.0182 


0.073 





076 


95.2 


-0.0049 


0.057 





058 


95.4 




HM 


-0.0437 


0.075 








-0.0051 


0.058 








4 


oracle 


-0.0330 


0.076 





079 


94.9 


-0.0021 


0.058 





059 


94.9 




PQ 


-0.0073 


0.074 





075 


95.5 


-0.0022 


0.058 





058 


94.8 




HM 


-0.0330 


0.076 








-0.0021 


0.058 








5 


oracle 


-0.0002 


0.075 





079 


95.7 


0.0005 


0.067 





063 


93.9 




PQ 


-0.0188 


0.079 





080 


95.3 


0.0056 


0.067 





064 


94.0 




HM 


-0.0204 


0.077 








0.0092 


0.067 








6 


oracle 


0.0003 


0.078 





080 


95.4 


0.0009 


0.063 





062 


94.8 




PQ 


-0.0012 


0.079 





080 


95.3 


0.0009 


0.063 





062 


94.7 




HM 


0.0003 


0.078 








0.0009 


0.063 
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correct the coverage rates for hard-thresholding and soft-thresholding estimators in 
some settings. However, neither of these two bootstrapped methods can consistently 
improve the coverage rate for all estimators. Overall, the soft-thresholding estimator 
has the best performance with the percentile bootstrapped confidence intervals. This 
is consistent with our findings for the PQ estimator due to their similar nature. 
Nonetheless, we derived the theoretical formula for standard errors and therefore 
did not need to rely on bootstrap method. Thus PQ-learning is substantially more 
computationally efficient than the soft-thresholding approach with the bootstrap and 
performs at least as well. 

5 Analysis of STAR*D study 

We here present the analysis of STAR*D study data using PQ-learning. STAR*D is 
a prospective multi-site randomized clinical trial designed to determine the compar- 
ative effectiveness of different multi-level treatment options for patients with major 
depressive disorder (MDD). A total of 4041 patients (ages 18-75) with nonpsychotic 
MDD were enrolled and initially treated with citalopram (CIT) (Level 1 treatment) 
for a minimum of 8 weeks, with strong encouragement to complete 12 weeks in order 
to maximize benefit. During this and all subsequent treatment levels, patients would 
have clinic visits at weeks 0, 2, 4, 6, 9 and 12. 

During all clinic visits, symptomatic status would be measured by the 16-item 
Quick Inventory of Depressive Symptomatology V Clinician- Rated (QIDS-Cie). Pa- 
tients who did not have a satisfactory response to treatments, defined as either 
< 50% reduction in QIDS-Ci 6 or QIDS-Ci 6 > 5, would be elligible for Level 2 treat- 
ment. Seven treatment options are available at Level 2, which can be classified into 
two classes, (1) Medication or Psychotherapy Switch: sertraline (SER), venlafax- 
ine (VEN), bupropion (BUP) or Cognitive Psychotherapy (CT); and (2) Medication 
or Psychotherapy Augmentation: CIT+BUP, CIT+buspirone (BUS) or CIT+CT. 
Patients who were assigned to CT or CIT+CT in Level 2 and did not have a satis- 
factory response would be elligible for Level 2A, during which they would be treated 
with either VEN or BUP. Patients who did not respond satisfactorily at Level 2 and 
Level 2 A, if applicable, would continue to Level 3 treatment. Level 3 includes four 
options: Medical Switch to mirtazapine (MIRT) or nortriptyline (NTP), and Med- 
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Table 3: Summary statistics and empirical coverage probability of 95% nominal per- 
centile CIs for ^xio and ^120 using the hard max (HM) estimator, the hard threshold 
estimator with a = 0.08 (HT , 08 ) and a = 0.20 (HTn.m), and the soft -threshold es- 
timator quoting the simulation results from IChakraborty et al.l ( 120091 ). Specifically, 
"Var" denotes the sample variance of 1000 estimates for ipno or ^120- "PB", "HB" 
and "DB" denote percentile bootstrap, hybrid bootstrap and double bootstrap, re- 
spectively. "CP" and "*" have the same meaning as given in Table 2. 



Setting 


Estimator 


Bias 




Var 


PB 


CP 
HB 


DB 


1 


HM 


0.0003 





0045 


96.8* 


93.5* 


93.6 




HTo.os 


0.0017 





0044 


97.0* 


95.0 


— 




HT .20 


0.0002 





0050 


97.4* 


92.8* 


— 




ST 


0.0009 





0036 


95.3 


96.1 


— 


2 


HM 


0.0003 





0045 


96.7* 


93.4* 


93.6 




HT .o8 


0.0010 





0044 


97.1* 


95.3 


— 




HT0.20 


0.0003 





0050 


97.3* 


93.5* 


— 




ST 


0.0008 





0036 


95.4 


95.9 


— 


3 


HM 


0.0401 





0059 


88.4* 


92.7* 


94.8 




HT .o8 


0.0083 





0058 


94.3 


94.3 


— 




HT0.20 


0.0179 





0062 


93.5* 


93.5* 


— 




ST 


0.0185 





0055 


93.4* 


94.9 


— 


4 


HM 


0.0353 





0059 


89.6* 


93.1* 


94.4 




HT .o8 


0.0037 





0058 


94.6 


94.1 


— 




HT .20 


0.0130 





0062 


93.9 


92.8* 






ST 


0.0138 





0055 


94.1 


95.0 




5 


HM 


0.0209 





0069 


92.7* 


93.1* 


94.2 




HT .o8 


0.0059 





0070 


93.9 


93.2* 






HT .20 


0.0101 





0072 


93.3* 


93.0* 






ST 


0.0065 





0069 


93.8 


94.6 




6 


HM 


0.0009 





0067 


95.0 


93.8 


95.0 




HT .o8 


0.0003 





0081 


95.1 


88.5* 






HT0.20 


0.0011 





0074 


94.8 


91.2* 






ST 


0.0052 





0074 


94.8 


91.7* 
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Simulation setting 



Figure 2: Plot of coverage probabilities with all ten inference methods in six settings, 
where the shaded area is 95% confidence for monte-carlo error: [93.7%, 96.3%]. 
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ical augmentation with either lithium or thyroid hormone added to level 2 or 2A 
treatments. Patients who did not respond satisfactorily to Level 3 treatments would 
continue to Level 4 treatments, which include two options: swi tch to trany l cypro mine 
or MIRT+VEN. For a complete description of STAR*D, see Fava et all (120031 ) and 



Rush et al. 



(12004) 



In this analysis, for demonstration purpose, we consider a subgroup of STAR*D 
patients, the 112 patients who were randomized to either BUP or SER in Level 2, had 
no satisfactory response at the end of Level 2, and were then randomized to either 
MIRT or NTP in Level 3. The analysis focuses on selecting the optimized treatment 
regime at Level 2 and Level 3, out of the 4 unique treatment combinations. Since the 
higher QIDS-C16 is? the more severe the symptom is, we define the rewa rd as negative 
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of QID S-Cifi collected at the end of the Level 3. Similarly as discussed in 
( 120071 ). the state variable used to tailor individual treatment is the changing rate of 
QIDS-C16 during the previous treatment level. We dichotomize the changing rates at 
zero. Two patients were further removed due to missing values in the reward or the 
tailor variables. 

Following the notations in the simulation study, let 0± and 2 be the indicator 
of whether the QIDS-C16 changing rate is greater than zero in Level 1 and Level 2 
respectively. Let A\ = 1 if Level 2 treatment is SER and Ai = —1 if it is BUP. Let 
A 2 = 1 if Level 3 treatment is NTP and A 2 = -1 if it is MIRT, R 2 = -QIDS - C ie 
collected at the end of Level 3. The Level 3 regression model is: 

R2 = Au + /32201 + /3 23 ^l + f^2i0 2 + ^21^2 + ^22^2^1 + ^23^2^1 + ^24^2^2 + S 2 . 

Since the main effects of Ai and 2 are not statistically significant, we did not include 
additional interaction terms in Level 3 model. 

Table 4 shows the Level 3 regression model coefficients estimation using both 
unpenalized standard least square estimation and individual penalized least square 
estimation. Qualitatively, unpenalized and penalized estimations are consistent. Pa- 
tients whose symptom worsened (i.e., 0\ — 1 or QIDS-C16 increased) during Level 1 
would have worse reward. Level 2 treatments (SER versus BUP) as well as QIDS-C16 
changing rate during Level 2 show no differential effect on the final outcome. However, 
the two Level 3 treatment options show significant different effects on patients with 
0\ — 1 versus patients with 0\ = — 1. Among patients whose symptom worsened 
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Table 4: Level 3 regression model coefficients estimation using both unpenalized least 
square estimation and individual penalized least square estimation. 



Variable 


unpenalized 


penalized 




Coefficient 


95% CI 


Coefficient 


95% CI 


Intercept 


-13.165 


(-14.349, -11.981) 


-13.185 


(-14.330, -12.039) 


Oi 


-1.202 


(-2.348, -0.057) 


-1.124 


(-2.233, -0.015) 


A 1 


0.004 


(-0.945, 0.954) 


-0.046 


(-0.967, 0.874) 


o 2 


-0.587 


(-1.605, 0.431) 


-0.554 


(-1.533, 0.425) 


A 2 


-1.276 


(-2.460, -0.092) 


-1.266 


(-2.239, -0.292) 


O x A 2 


-1.621 


(-2.766, -0.475) 


-1.300 


(-2.410, -0.191) 


A X A 2 


0.535 


(-0.414, 1.484) 


0.052 


(-0.775, 0.880) 


2 A 2 


0.278 


(-0.740, 1.297) 


0.017 


(-0.748, 0.783) 



in Level 1, NTP further worsened their symptom when compared to MIRT. Among 
patients whose symptom improved in Level 1, NTP and MIRT show no obvious dif- 
ference as to the final outcome. 

Quantitatively, the penalized estimator has smaller standard errors in the coef- 
ficient estimation of ij) 2 = (^2i> ^22, ^23, V ; 24) T than the unpenalized estimator. In 
addition, the penalized estimator dramatically shrinks coefficients of the two unim- 
portant predictors AiA 2 and 02^2 toward zero. On the other hand, these two es- 
timators are similar in the coefficient estimation of /3 2 = {j3 2 i, f3 22 , /?23, P 2 4) T , which 
is expected since the penalty is imposed only on i/? 2 . In order to shrink coefficients 
of the unimportant predictors A\ and 2 , one can further impose penalty on |/3 2 |, 
which will not be implemented in this work. The lack of effect of A\ and 2 is 
actually expected since we include in this analysis only patients eligible for Level 3 
treatment, in other words, only patients who did not respond satisfactorily to Level 2 
treatment. This inclusion criteria is imposed because our current framework is built 
on the situation where all patients will be treated in both stages. The extension to 
cases where patients may be cured during intermediate stages and hence not eligible 
for subsequent treatment stages is not trivial and will be considered in future work. 

Table[5]shows the estimated values of (S^V^I; where S22 = (1, 0±, Ax, 2 ) T . When 
0\ = —1, Level 3 treatment effect is small but the unpenalized estimator shows 
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Table 5: Values of |S^ 2 | in STAR*D study. 



01 


Al 


02 


Unpenalized 


2 

Penalized 


-1 


-1 


-1 


0.468 


0.035 


-1 


-1 


1 


0.088 


0.000 


-1 


1 


-1 


0.601 


0.070 


-1 


1 


1 


1.158 


0.105 


1 


-1 


1 


3.153 


2.601 


1 


1 


-1 


2.640 


2.531 


1 


-1 


-1 


3.710 


2.636 


1 


1 


1 


2.083 


2.496 



Table 6: Level 2 regression model coefficients estimation using both Hard-max and 
PQ-learning. 





Hard-Max 


PQ-learning 


Variable 


Coefficient 


Hybrid 95% CI 


Coefficient 95% CI 


Intercept 


-11.063 


(-12.482, -10.095) 


-11.612 (-13.076, -10.149) 


Oi 


0.263 


(-0.764, 1.547) 


0.313 (-1.114, 1.740) 


A l 


-0.119 


(-1.120, 0.884) 


-0.038 (-1.115, 1.039) 


1 A 1 


-0.448 


(-1.079, 0.251) 


-0.085 (-0.830, 0.661) 



significant bias from zero. On the other hand, the penalized estimator successfully 
shrinks the value of IS^V^I i n an groups close to zero. Although due to the limitation 
of current LQA algorithm, the penalized estimator cannot exactly set |S^ 2 i/} 2 | to zero, 
the bias is significantly smaller than unpenalized estimator. When (0\, A\,02) = 
(— 1, — 1, 1), the penalized estimation of IS^i/^l falls below the preselected cutoff of 
0.001 and is shown as in Table [5j When 0\ = 1, the treatment option MIRT can 
significantly improve the symptom. Since A\ and O2 have no important effect on the 
outcome, we expect similar treatment effects among the four groups with 0\ — 1. 
From this point of view, the unpenalized estimator is inferior since it shows much 
bigger variation than the penalized estimator. 

We next consider the Level 2 regression model. The pseudo-outcome Y is defined 
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byF = /3^S 21 + |V^S 22 | and we impose the following Level 2 model 

Y = p n + + 13 A X + 0uO x A x . 

Table 6 shows the Level 2 model coefficient estimation using both Hard-max and 
PQ-learning. The coeffiecients estimations for the intercept and 0\ are similar from 
two different estimation methods. While in the estimation of cofficients for Ai and 
OiAi, PQ-learning's estimation is significantly towards zero. Based on 95% CI from 
PQ-learning, 0±, A\ have no effect on the pseudo-outcome. Since A\ shows no effect 
in Level 3 regression either, it is easy to interpret its lack of effect on pseudo-outcome. 
In contrast, 0\ is a strong predictor in Level 3 treatment. Its lack of effect in Level 
2 regression may be explained as follows. In Level 3 regression, the 0\ = 1 group's 
reward is smaller than the 0\ = — 1 group's reward by 2/3 22 ~ 2.24. However, the 
optimal Level 3 treatment can increase the 0\ — 1 group's reward by |V' 2 r S 22 | ~ 2.6 
but cannot increase the 0\ = — l's reward. Combined together, 0\ has no effect on 
the pseudo-outcome. 

Our analysis found that the optimal Level 2 and Level 3 treatment regime is 
following. If a patient's symptom worses in Level 1 and remains unsatisfactory in 
Level 2, MIRT is a better option for Level 3 treatment when compared to NTP. If a 
patient's symptom improves in Level 1 and remains unsatisfactory in Level 2, MIRT 
or NTP have similar effect as Level 3 treatment. 

6 Discussion 

In this article, we propose a penalized Q-learning framework and an individual se- 
lection procedure for developing optimal dynamic treatment regimes. Statistical in- 
ference for parameters at each stage are established. The long-term difficulty in 
developing optimal dynamic treatment regimes — non-regularity associated with the 
treatment effect parameters — is solved. The methods are shown to be effective and 
the standard errors are estimated computationally efficiently and with good accuracy. 

The proposed concept of individual selection is generally applicable. Specifically, 
the Q-learning approach is an inefficient special case of the doubly robust structural 
nested mean model (drSNMM) proposed by Robins (2004). The drSNMM is an 
estimating equation approach, which also has the difficulty of non-regularity. The 
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PQ-learning approach proposed here can be straightforwardly extended to penalized 
drSNMM to handle the non-regularity issue. 

Although the linear model form of the Q-functions presented here is an important 
first step, as well as being useful for illustrating the ideas of this paper, this form 
may not be sufficiently flexible for certain practical settings. Semiparametric models 
are a potentially very useful alternative in many such settings because such models 
involve both a parametric component which is usually easy to interpret and a non- 
parametric component which allows greater flexibility. Generalizations of Q-functions 
to allow diverse data such as ordinal outcome, censored outcome and semiparametric 
modeling, are thus future research topics of practical importance. 

The current theoretical framework is based on discrete covariates. This condition 
is not as restricted as it looks. For example, in a practical two-stage setting where 
continuous covariates are presented, unless in the rare case where the parameter 
V>2o i s zer °) the "problematic" set = {i : V^o^i = 0} will not have positive 
probability. Having said that, we can always discretize the continuous covariates into 
several strata and apply the proposed methods with these strata. Obviously, there 
will be loss of information with this approach. Future research to extend our work to 
continuous covariates would also be very useful in practice. Likewise, the current PQ- 
learning framework works for two-level treatments. The generalization to multilevel 
treatments will be a natural and useful next step. 

In many clinical studies, the state space is often of very high dimension. To 
develop optimal dynamic treatment regimes in this case, it will be important to 
develop simultaneous variable selection (for state variables) and individual selection. 
More modern machine learning techniques such as support vector regression and 
random forests can be nested into our PQ-learning framework as powerful tools to 
develop optimal dynamic treatment regimes. 
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